Principal Component Analysis

PCA is an orthogonal linear transformation that transforms the data to a new coordinate system such that the greatest variance by some projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on. PCA analysis is perfromed either using EVD of the covariance matrix or the SVD of the mean-centered data.

Prerequisites

The reader should be familiar with linear algebra and statistics concepts.

Competences

The reader should be able to perform PCA on a given data set.

References

For more details see

Definitions

A data matrix is a matrix $X\in\mathbb{R}^{m\times n}$, where each column corresponds to a feature (say, certain gene), and each row correspond to an observation (say, individual).

A mean of a vector $x\in\mathbb{R}^{n}$ is $\mu(x)=\displaystyle \frac{x_1+x_2+\cdots x_n}{n}$.

A standard deviation of a vector $x$ is $\sigma(x)=\displaystyle \sqrt{\frac{\sum_{i=1}^n (x_i-\mu(x))^2}{n-1}}$. A variance of a vector $x$ is $var(x)=\sigma^2(x)$.

A vector of means of a data matrix $X$ is a row-vector of means of the columns of $X$, $\mu(X)=\begin{bmatrix}\mu(X_{:,1}) & \mu(X_{:,2}) & \ldots & \mu(X_{:,n})\end{bmatrix}$.

A zero-mean centered data matrix is a matrix $\bar X$ obtained from a data matrix $X$ by subtracting from each column the mean of this column, $$ \bar X= \begin{bmatrix} X_{:,1}-\mu(X_{:,1}) & X_{:,2}-\mu(X_{:,2}) & \cdots & X_{:,n}-\mu(X_{:,n}) \end{bmatrix}\equiv X-\mathbf{1}\mu(X), $$ where $\mathbf{1}=[1,1,\ldots,1]^T$.

A covariance matrix of a data matrix $X$ is a matrix $$ \mathop{\mathrm{cov}}(X)=\displaystyle \frac{1}{n-1}[X-\mathbf{1}\mu(X)]^T[X-\mathbf{1}\mu(X)] \equiv \frac{\bar X^T \bar X}{n-1}. $$

Facts

Given a data matrix $X$, let $\mathop{\mathrm{cov}}(X)=U\Lambda U^T$ be the EVD with non-increasingly ordered eigenvalues, $\lambda_1\geq \lambda_2\geq \cdots \geq \lambda_n$.

$\mathop{\mathrm{cov}}(X)$ is a symmetric PSD matrix.
$\mathop{\mathrm{cov}}(X)=\mathop{\mathrm{cov}}(\bar X)$.
Let $T=\bar X U$. The columns of $T$ are the principal components of $\bar X$. In particular:
1. The first principal component of $\bar X$ (or $X$) is the first column, $T_{:,1}$. It is a projection of the zero-mean centered data set $\bar X$ on the line defined by $U_{:,1}$. This is the direction along which the data have the largest variance.
2. The second column (the second principal component), $T_{:,1}$, is a projection of $\bar X$ on the line defined by $U_{:,2}$, which is orthogonal to the first projection. This is direction with the largest variance after subtracting the first principal component from $\bar X$.
3. The $k$-th principal component is the direction with the largest variance after subtracting the first $k-1$ principal components from $\bar X$, that is, the first principal component of the matrix $$ \hat X=\bar X-\sum_{i=1}^{k-1} \bar X U_{:,i} U_{:,i}^T. $$
Let $\bar X=\bar U \Sigma V^T$ be the SVD of $\bar X$. Then $V=U$ and $T=\bar U\Sigma V^T V\equiv \bar U \Sigma$.
Reconstruction of the principal components is the following:
1. Full reconstruction is $X=T U^T +\mathbf{1} \mu(X)$.
2. Reconstruction from the first $k$ principal components is $\tilde X =T U_{:,1:k}^T +\mathbf{1} \mu(X)$.
Partial reconstructions can help obtaining various insights about the data. For example, the rows of the matrix $T_{:,1:k}$ can be clustered by the $k$-means algorithm, and the points defined by first three columns of $T$ can be plotted to visualize projections of clusters. Afterwards, the computed clusters can be mapped back to original data.
Heuristical guess for number of important clusters is given be the location of the "knee" in the plot of the singular values of $\bar X$.

Example - Elliptical data set

We generate a "quasi" elliptical set of points and compute its principal components.



In [2]:

    
# Generate data points
using Random
Random.seed!(456)
n=3
m=500
ax=[8,3,1]
X=Array{Float64}(undef,m,n)
for i=1:n
    X[:,i]=rand!(X[:,i])*ax[i]
end
# Parameters
u=(rand(m).-0.5)*π
v=(rand(m).-0.5)*2*π
for i=1:m
    X[i,1]=X[i,1]*cos(u[i])*cos(v[i])
    X[i,2]=X[i,2]*cos(u[i])*sin(v[i])
    X[i,3]=X[i,3]*sin(u[i])
end



In [3]:

    
X0=copy(X)









    Out[3]:





500×3 Array{Float64,2}:
  0.210576    -0.127732    0.776479   
  0.591569    -0.876754    0.262941   
  3.71806     -0.0563136  -0.024527   
 -3.22498     -0.652015   -0.0999344  
 -0.00225668   0.184277   -0.343653   
  0.592944     2.66002     0.0073981  
 -0.0503422   -0.0952142  -0.00904418 
  0.370151    -0.153789   -0.264933   
 -3.27693      0.339955    0.404337   
 -2.30723      0.751473    0.399126   
  2.48328     -0.579145   -0.10629    
  0.460629    -0.0970859   0.354729   
  3.38002      0.760876   -0.111974   
  ⋮                                   
 -5.24438     -0.121456    0.0163747  
 -3.93267      0.63387    -0.0962625  
 -1.26243      0.0279709   0.248269   
 -0.0406195   -0.558751   -0.230258   
  2.57882      0.0022758   0.141931   
  4.36542      0.94466    -0.420983   
 -0.55316      0.142784    0.00606709 
 -2.59395      1.04655    -0.724932   
  0.157457    -0.60424     0.664749   
 -4.03607     -0.702379   -0.127965   
 -0.682568     1.11439    -0.181113   
 -1.60667     -1.75284     0.000999079



In [4]:

    
using PyPlot



In [5]:

    
plot3D(X0[:,1],X0[:,2],X0[:,3],"+")









    












    Out[5]:





1-element Array{PyCall.PyObject,1}:
 PyObject <mpl_toolkits.mplot3d.art3d.Line3D object at 0x0000000025F3DDD8>



In [6]:

    
# Compute the means. How good is the RNG?
using Statistics
μ0=mean(X,dims=1)









    Out[6]:





1×3 Array{Float64,2}:
 -0.0920898  0.0422521  -0.0313562



In [7]:

    
# Subtract the means
using LinearAlgebra
X=X.-μ0
# Rotate by a random orthogonal matrix
Q,r=qr(rand(3,3))
X=X*Q;



In [8]:

    
# Translate
S=[3,-2,4]









    Out[8]:





3-element Array{Int64,1}:
  3
 -2
  4



In [9]:

    
X.+=S'
# plot3D(X0[:,1],X0[:,2],X0[:,3],"b+")
plot3D(X[:,1],X[:,2],X[:,3],"rx" )









    












    Out[9]:





1-element Array{PyCall.PyObject,1}:
 PyObject <mpl_toolkits.mplot3d.art3d.Line3D object at 0x0000000022D9A5F8>



In [10]:

    
C=cov(X)









    Out[10]:





3×3 Array{Float64,2}:
 0.619018    0.0889886   0.243518
 0.0889886   3.36625    -2.70555 
 0.243518   -2.70555     2.65804



In [11]:

    
μ=mean(X,dims=1)









    Out[11]:





1×3 Array{Float64,2}:
 3.0  -2.0  4.0



In [12]:

    
# Fact 2
cov(X.-μ), (X.-μ)'*(X.-μ)/(m-1)









    Out[12]:





([0.619018 0.0889886 0.243518; 0.0889886 3.36625 -2.70555; 0.243518 -2.70555 2.65804], [0.619018 0.0889886 0.243518; 0.0889886 3.36625 -2.70555; 0.243518 -2.70555 2.65804])



In [13]:

    
# Principal components, evals are non-decreasing
λ,U=eigen(C)









    Out[13]:





Eigen{Float64,Float64,Array{Float64,2},Array{Float64,1}}
eigenvalues:
3-element Array{Float64,1}:
 0.15669814787095282
 0.7441175461904636 
 5.742492308693448  
eigenvectors:
3×3 Array{Float64,2}:
  0.464569  0.885347   0.0183332
 -0.578243  0.318972  -0.750928 
 -0.670679  0.338257   0.66013



In [14]:

    
# Largest principal component
T1=(X.-μ)*U[:,3]
subplot(221)
plot(T1,zero(T1),"rx")
subplot(222)
plot(T1,"b+")









    












    Out[14]:





1-element Array{PyCall.PyObject,1}:
 PyObject <matplotlib.lines.Line2D object at 0x0000000022E45908>



In [15]:

    
# Two largest principal components
T2=(X.-μ)*U[:,[3,2]]
axis("equal")
plot(T2[:,1],T2[:,2],"r+")









    












    Out[15]:





1-element Array{PyCall.PyObject,1}:
 PyObject <matplotlib.lines.Line2D object at 0x0000000026614860>



In [16]:

    
# All  three principal components
T=(X.-μ)*U[:,[3,2,1]]
plot3D(T[:,1],T[:,2],T[:,3],"rx" )









    












    Out[16]:





1-element Array{PyCall.PyObject,1}:
 PyObject <mpl_toolkits.mplot3d.art3d.Line3D object at 0x000000002BE7FBA8>



In [17]:

    
# Fact 5 - Recovery of the largest component
Y1=T1*U[:,3]'.+μ
plot3D(Y1[:,1],Y1[:,2],Y1[:,3],"rx" )









    












    Out[17]:





1-element Array{PyCall.PyObject,1}:
 PyObject <mpl_toolkits.mplot3d.art3d.Line3D object at 0x000000002BEC5588>



In [18]:

    
# Recovery of the two largest components
Y2=T2*U[:,[3,2]]'.+μ
plot3D(Y2[:,1],Y2[:,2],Y2[:,3],"rx" )









    












    Out[18]:





1-element Array{PyCall.PyObject,1}:
 PyObject <mpl_toolkits.mplot3d.art3d.Line3D object at 0x000000002BF43320>



In [19]:

    
# Recovery of all three components (exact)
Y3=T*U[:,[3,2,1]]'.+μ
plot3D(Y3[:,1],Y3[:,2],Y3[:,3],"rx" )
plot3D(X[:,1],X[:,2],X[:,3],"b+" )









    












    Out[19]:





1-element Array{PyCall.PyObject,1}:
 PyObject <mpl_toolkits.mplot3d.art3d.Line3D object at 0x000000002BFB7E10>



In [20]:

    
# Fact 4 - PCA using SVD
function myPCA(X::Array{T}, k::Int) where T
    μ=mean(X,dims=1)
    U,σ,V=svd(X.-μ)
    U[:,1:k]*Diagonal(σ[1:k])
end









    Out[20]:





myPCA (generic function with 1 method)



In [21]:

    
T1s=myPCA(X,1)
[T1s T1]









    Out[21]:





500×2 Array{Float64,2}:
 -0.299962   -0.299962 
 -0.677274   -0.677274 
 -3.80944    -3.80944  
  3.13705     3.13705  
 -0.0913534  -0.0913534
 -0.701393   -0.701393 
 -0.0408378  -0.0408378
 -0.46147    -0.46147  
  3.18378     3.18378  
  2.21151     2.21151  
 -2.57156    -2.57156  
 -0.551052   -0.551052 
 -3.47671    -3.47671  
  ⋮                    
  5.1533      5.1533   
  3.83665     3.83665  
  1.17097     1.17097  
 -0.048093   -0.048093 
 -2.67025    -2.67025  
 -4.46387    -4.46387  
  0.460503    0.460503 
  2.4941      2.4941   
 -0.244074   -0.244074 
  3.94838     3.94838  
  0.583425    0.583425 
  1.52589     1.52589



In [22]:

    
# The two largest components using SVD
T2s=myPCA(X,2)
[T2s T2]









    Out[22]:





500×4 Array{Float64,2}:
 -0.299962   -0.160003    -0.299962    0.160003  
 -0.677274   -0.918847    -0.677274    0.918847  
 -3.80944    -0.122284    -3.80944     0.122284  
  3.13705    -0.67559      3.13705     0.67559   
 -0.0913534   0.136863    -0.0913534  -0.136863  
 -0.701393    2.61372     -0.701393   -2.61372   
 -0.0408378  -0.137383    -0.0408378   0.137383  
 -0.46147    -0.202334    -0.46147     0.202334  
  3.18378     0.323977     3.18378    -0.323977  
  2.21151     0.729302     2.21151    -0.729302  
 -2.57156    -0.638525    -2.57156     0.638525  
 -0.551052   -0.137113    -0.551052    0.137113  
 -3.47671     0.695634    -3.47671    -0.695634  
  ⋮                                              
  5.1533     -0.130762     5.1533      0.130762  
  3.83665     0.614611     3.83665    -0.614611  
  1.17097    -0.00285697   1.17097     0.00285697
 -0.048093   -0.604167    -0.048093    0.604167  
 -2.67025    -0.0541343   -2.67025     0.0541343 
 -4.46387     0.868697    -4.46387    -0.868697  
  0.460503    0.103952     0.460503   -0.103952  
  2.4941      1.00964      2.4941     -1.00964   
 -0.244074   -0.637758    -0.244074    0.637758  
  3.94838    -0.721286     3.94838     0.721286  
  0.583425    1.0735       0.583425   -1.0735    
  1.52589    -1.78491      1.52589     1.78491

Example - Real data

We will cluster three datasets from Workshop "Principal Manifolds-2006":

Data set	# of genes (m)	# of samples (n)
D1	17816	286
D2	3036	40
D3	10383	103



In [23]:

    
# Data set D1
using DelimitedFiles
f=readdlm("files/d1.txt")









    Out[23]:





17817×287 Array{Any,2}:
 "CHIP"                        "GSM36777"  …    "GSM37061"    "GSM37062"
 "1007_s_at"                  0.12             0            -0.43       
 "1053_at"                    0.14             0.46         -0.2        
 "117_at"                    -0.14            -0.8          -0.29       
 "121_at"                    -0.11             0.21          0.24       
 "1255_g_at"                 -0.73         …   0.52          0.72       
 "1294_at"                   -0.26            -0.13         -1.06       
 "1316_at"                   -0.05            -0.55         -0.05       
 "1405_i_at"                 -0.85            -2.92         -0.73       
 "1431_at"                   -0.67             0.67          0.58       
 "1438_at"                   -0.49         …   0.87         -1.63       
 "1487_at"                   -0.47            -0.11          0.32       
 "1494_f_at"                  3.97            -0.61         -0.17       
 ⋮                                         ⋱   ⋮                        
 "AFFX-r2-Ec-bioC-3_at"      -0.59         …   0.03         -0.27       
 "AFFX-r2-Ec-bioC-5_at"      -0.75            -0.01         -0.44       
 "AFFX-r2-Ec-bioD-3_at"      -0.59             0.01         -0.43       
 "AFFX-r2-Ec-bioD-5_at"      -0.7              0.14         -0.39       
 "AFFX-r2-Hs18SrRNA-3_s_at"   0.23            -1.98          0.23       
 "AFFX-r2-Hs18SrRNA-5_at"     1.2          …  -1.8           0.2        
 "AFFX-r2-Hs18SrRNA-M_x_at"   0.57            -0.86          0.61       
 "AFFX-r2-Hs28SrRNA-3_at"     1.01            -0.94          0.21       
 "AFFX-r2-Hs28SrRNA-5_at"     0.47            -0.85          0.54       
 "AFFX-r2-Hs28SrRNA-M_at"     0.63            -1.18          0.59       
 "AFFX-r2-P1-cre-3_at"       -0.46         …   0.09         -0.26       
 "AFFX-r2-P1-cre-5_at"       -0.52             0.06         -0.28



In [24]:

    
sizeof(f)









    Out[24]:





40907832



In [25]:

    
X=map(Float64,f[2:end,2:end])









    Out[25]:





17816×286 Array{Float64,2}:
  0.12   0.88   0.58   0.19   0.27  …  -0.2   -0.35  -0.24   0.0   -0.43
  0.14  -0.88  -0.22   0.94   1.01      0.36   0.02  -0.37   0.46  -0.2 
 -0.14  -0.31   0.2    0.41   0.48      0.56  -0.05   0.1   -0.8   -0.29
 -0.11   0.94   0.15   0.02  -0.06      0.05  -0.02   0.12   0.21   0.24
 -0.73   1.46  -0.32   1.03   0.39     -0.17  -0.23   1.47   0.52   0.72
 -0.26   0.28  -0.24  -0.21  -0.01  …  -0.18   0.38   0.01  -0.13  -1.06
 -0.05   0.6    0.0    0.04   0.02     -0.41  -0.6    0.0   -0.55  -0.05
 -0.85  -0.31  -2.45   0.69   2.71      0.05   0.75  -0.33  -2.92  -0.73
 -0.67   0.63   0.11   0.19   0.28     -0.83  -0.01  -0.1    0.67   0.58
 -0.49  -0.98  -0.92   2.08   1.26      0.16  -0.23  -0.19   0.87  -1.63
 -0.47   0.08   0.13   0.09   0.67  …   0.0    0.14   0.5   -0.11   0.32
  3.97   0.52   4.93  -0.25   0.16     -0.54  -0.45   0.16  -0.61  -0.17
  0.11   0.37  -0.38  -0.23  -0.72      0.43   0.41  -0.49  -0.28  -0.3 
  ⋮                                 ⋱                               ⋮   
 -0.59   0.36   0.24   0.54   0.22      0.29   0.04   0.44   0.03  -0.27
 -0.75  -0.02   0.32   0.42   0.23  …   0.22  -0.09   0.62  -0.01  -0.44
 -0.59   0.25   0.4   -0.06  -0.14      0.36   0.13   0.74   0.01  -0.43
 -0.7    0.06   0.46  -0.07   0.03      0.39   0.1    0.64   0.14  -0.39
  0.23  -0.01   1.81   0.64  -1.27     -0.32  -0.57   0.1   -1.98   0.23
  1.2   -0.02   2.71   0.48  -4.13     -0.39  -1.04  -0.56  -1.8    0.2 
  0.57   0.72   2.45   0.61  -2.12  …  -0.72  -0.21  -0.13  -0.86   0.61
  1.01  -0.4    2.13   0.98  -1.13     -0.18  -0.62   0.22  -0.94   0.21
  0.47   0.19   1.86   0.8   -0.7      -0.37   0.27  -0.07  -0.85   0.54
  0.63   0.75   1.48   0.91  -1.16      0.06  -0.37   0.32  -1.18   0.59
 -0.46   0.25   0.42   0.45   0.33      0.34   0.13   0.6    0.09  -0.26
 -0.52   0.15   0.47   0.35   0.15  …   0.23   0.09   0.67   0.06  -0.28



In [26]:

    
# Clear a big variable
f=[]









    Out[26]:





0-element Array{Any,1}



In [27]:

    
# Fact 7 - Plot σ's and observe the knee
μ=mean(X,dims=1)
σ=svdvals(X.-μ)
plot(σ)









    












    Out[27]:





1-element Array{PyCall.PyObject,1}:
 PyObject <matplotlib.lines.Line2D object at 0x0000000001CAB2B0>



In [28]:

    
# PCA on X, keep 20 singular values
k=20
T=myPCA(X,k)









    Out[28]:





17816×20 Array{Float64,2}:
 -2.60997     1.29427    3.82777   …   0.416223   -0.908852    -0.339937 
  2.53162     6.72032    1.25924       0.950754   -0.53449      0.0961831
  2.63868     2.01813    2.13187       1.72483     0.513846    -0.0914956
 -0.631409   -0.94409    1.88523       0.440639   -0.745012     0.0996248
 -0.8162      0.174124   3.59941       1.30466    -2.34368      0.369603 
 -0.274186   -1.29081   -1.33486   …  -0.473495   -0.62421     -1.26462  
  0.401881   -0.738539   1.34847       0.39025    -0.00439161   0.0151878
 21.7269     -4.06805   -9.91123      -2.0272     -1.41633     -0.0263997
 -2.58371    -0.156413  -0.737911     -0.265203    1.58469      0.322162 
  4.17899    17.0184     1.83096       2.91487    -0.433558     1.164    
  0.772371    0.155414   1.66253   …  -0.833175    0.397829    -0.700775 
 -5.42643    -0.788577   1.64396      -1.44325     0.387156     0.840406 
  0.500175   -4.16445   -3.51728       0.221502    0.00109187  -1.79048  
  ⋮                                ⋱                                     
  0.602363   -1.58106    1.88832      -0.291308   -0.863038    -1.15118  
  0.539476   -1.8133     2.05333   …  -0.31821    -1.60131     -1.2452   
  0.485803   -1.87352    2.20944      -0.164564   -0.654407    -1.2894   
  0.475569   -1.44394    2.10787      -0.344536   -1.242       -1.30914  
  5.56309    -2.43271    4.40758      -2.06542     0.563161    -1.10901  
  5.82127     1.07777    2.81812      -2.12045    -0.50111     -0.404985 
  4.46274    -2.24485    4.53101   …  -1.62547    -0.859139    -0.507143 
  4.32701     1.4561     3.72729      -1.62872    -0.24935     -0.80077  
  3.20833    -3.0066     3.52118      -1.18471    -0.67259     -0.423825 
  3.35449    -3.69571    2.98001      -1.24199     0.215023    -0.144022 
 -0.0676899  -1.15867    2.49808      -0.213136   -1.55336     -0.767101 
  0.431614   -1.23256    2.16338   …   0.0266245  -1.6841      -0.930856



In [29]:

    
# Find k clusters
using Clustering
out=kmeans(T',k)









    Out[29]:





KmeansResult{Array{Float64,2},Float64,Int64}([-6.75249 -0.662165 … -1.1111 0.0884205; -0.592097 0.354341 … 0.121346 -2.65327; … ; 0.223095 0.110968 … 0.655065 0.422075; 0.0910122 -0.0124653 … 0.315076 -0.348167], [11, 10, 11, 2, 2, 17, 2, 3, 2, 5  …  17, 17, 9, 9, 9, 9, 9, 9, 2, 17], [16.837, 36.2713, 25.6976, 15.6004, 26.6287, 26.4335, 10.5758, 294.958, 21.645, 145.355  …  45.5321, 65.0738, 71.2243, 105.03, 94.0346, 57.3524, 58.6991, 64.8662, 55.6902, 52.3087], [908, 2867, 575, 1846, 423, 174, 361, 451, 430, 1330, 1884, 67, 109, 1760, 1353, 696, 1444, 80, 654, 404], [908, 2867, 575, 1846, 423, 174, 361, 451, 430, 1330, 1884, 67, 109, 1760, 1353, 696, 1444, 80, 654, 404], 877003.9289306196, 80, true)



In [30]:

    
# Plot points defined by the first three principal components
function myPlot(C::Vector, T::Array, k::Int)
    P=Array{Any}(undef,k)
    for j=1:k
        P[j]=T[C.==j,1:3]
        plot3D(P[j][:,1],P[j][:,2],P[j][:,3],"x")
    end
end









    Out[30]:





myPlot (generic function with 1 method)



In [31]:

    
myPlot(out.assignments,T,k)



In [32]:

    
# Data set D2
X=readdlm("files/d2fn.txt")









    Out[32]:





3036×40 Array{Float64,2}:
  0.0681   0.072    0.3944  -0.0606  …  -0.3723  -0.2963  -0.619   -0.4171
  0.1402   0.2517   0.0843   0.1794     -0.1204   0.0086  -0.8177  -0.3646
  0.0081   0.3179   0.5802   0.0162     -0.2968  -0.3981  -1.055   -0.4583
  0.1049   0.3031   0.3048   0.2214     -0.288   -0.0841  -0.7231  -0.2951
  0.0609   0.4262   0.104    0.227      -0.3255  -0.1652  -0.6634  -0.4358
 -0.1163   0.2005  -0.2816   0.1192  …  -0.2508  -0.034   -0.5112  -0.1964
  0.1276   0.4596   0.3751   0.2379     -0.4025  -0.3775  -0.6091  -0.4265
  0.1362   0.4215   0.0645   0.3017     -0.3727  -0.3352  -0.642   -0.4601
  0.1455   0.3676   0.1463   0.2997     -0.4893  -0.5417  -0.6408  -0.5108
 -0.7093   0.6045  -0.6372  -0.6594     -0.1644  -0.0441   0.4404   0.4663
 -0.0044   0.1012   0.1362   0.432   …  -0.1594   0.4027  -0.5591  -0.1365
  0.1288   0.6568   0.7199   0.2173     -0.2006   0.0184  -1.1106   0.1015
  0.5014   0.6976   0.9516   0.6017     -1.168   -2.0287  -1.0887  -0.5792
  ⋮                                  ⋱                                    
  0.2222   0.0015   0.0284  -0.3749      0.2817  -0.4384   0.2771   0.7985
  0.2587   0.1443  -0.1444   0.4272  …  -0.4435  -0.6569  -0.6596  -0.4899
  0.3712   0.3939   0.4392   0.4054     -0.7037  -0.9145  -0.8822  -0.7753
  0.3355   1.0494   1.1719   0.5783     -0.7814  -1.0491  -0.7743   0.041 
  0.3646   0.5563   0.375    0.3172     -0.5824  -0.7064  -0.8294  -0.8036
  0.4069   0.7489   0.1103   0.0303     -0.4834  -0.6479  -0.7976  -0.431 
 -0.5085   0.5377   0.6047  -0.5306  …  -0.2162  -1.8589   0.2188   0.2223
  0.2777  -0.403   -0.0952   0.2562     -0.303    0.5055  -0.8516   0.2553
 -0.0092  -0.1585   0.156   -0.6566      0.1016  -0.2363  -0.0053  -0.3698
  0.1618   0.0412  -0.3044   0.0847      0.5937  -0.0567   0.9136  -0.5945
  0.3653  -0.1373   0.0607   0.2313      0.3029   0.6607  -0.2687  -0.7253
  0.1646  -0.3546  -0.2455  -0.0452  …  -0.2054   0.2651  -0.3522  -0.7788



In [33]:

    
μ=mean(X,dims=1)
σ=svdvals(X.-μ)
plot(σ)









    












    Out[33]:





1-element Array{PyCall.PyObject,1}:
 PyObject <matplotlib.lines.Line2D object at 0x0000000001D84A90>



In [34]:

    
k=4
T=myPCA(X,k)
out=kmeans(T',k)
myPlot(out.assignments,T,k)

We use the package CSV.jl and replace the "NULL" string with missing.



In [35]:

    
using CSV



In [36]:

    
sf = CSV.read("files/d3efn.txt", header=0,missingstring="NULL")









    Out[36]:




10,387 rows × 103 columns (omitted printing of 95 columns)
Column1 Column2 Column3 Column4 Column5 Column6 Column7 Column8
Float64⍰ Float64⍰ Float64⍰ Float64⍰ Float64⍰ Float64⍰ Float64⍰ Float64⍰
1 0.179 -0.201 1.289 0.039 0.869 1.139 -0.421 -0.251
2 1.042 0.162 0.612 0.472 0.382 1.392 0.282 0.612
3 -0.022 -0.482 missing -0.182 missing 0.988 -0.202 -0.572
4 -0.597 0.183 0.593 -0.467 0.273 missing 0.283 -0.387
5 0.699 0.639 0.739 0.349 missing -0.081 0.229 -0.281
6 1.035 -3.015 -2.255 -2.725 -2.095 0.645 -0.295 -0.875
7 0.64 -4.9 -4.24 -4.34 -3.39 0.54 -2.6 -2.49
8 0.904 -2.636 -2.446 -1.996 missing 1.094 -1.236 -1.096
9 0.74 -2.15 missing -3.43 missing 0.3 -1.73 -1.57
10 0.771 missing -1.389 -1.249 missing 0.991 -1.269 -0.739
11 0.351 -0.829 -1.239 -0.679 -1.089 0.411 -0.899 -0.489
12 0.926 -1.034 missing -0.654 missing 0.716 -0.254 -0.964
13 -0.216 -0.126 0.484 0.024 0.134 -0.606 -0.296 -0.226
14 -0.154 -0.884 missing -0.414 missing -0.414 -0.694 -0.454
15 -0.796 -2.096 -1.076 -0.806 0.074 -1.136 -1.536 -0.776
16 -0.791 -1.751 -0.591 -0.821 0.109 -0.811 -1.481 -0.671
17 -0.89 -1.78 -0.12 -0.36 0.6 -1.32 -1.3 -0.05
18 -1.192 -2.292 -0.992 -0.972 -1.152 -0.772 -1.542 -0.692
19 -1.441 -1.681 -0.491 -1.681 -0.311 missing -1.641 -1.471
20 -1.103 -0.823 -1.403 -1.973 missing -1.293 -1.153 -0.243
21 1.668 -0.322 0.688 1.158 1.588 0.398 0.408 -0.202
22 1.485 -0.075 0.805 1.085 1.785 0.275 0.405 -0.065
23 -0.348 0.032 missing -0.488 0.212 -0.048 -0.138 0.372
24 -0.281 -0.571 0.029 missing 0.329 missing -0.211 -0.141
25 -0.088 -0.688 missing -0.668 -0.398 0.122 -0.028 0.352
26 -0.289 missing -0.309 missing -0.249 -0.469 0.411 -0.269
27 0.39 0.19 -0.62 -0.04 0.09 -0.04 0.13 0.14
28 -0.48 -1.0 -1.57 -1.19 -0.66 -0.8 -0.72 -0.76
29 0.446 -1.044 -0.444 -0.314 0.686 0.106 -0.534 -0.164
30 0.557 -1.163 0.267 -0.423 0.927 0.287 -0.433 0.497
&vellip &vellip &vellip &vellip &vellip &vellip &vellip &vellip &vellip



In [37]:

    
# sf[:,103]



In [38]:

    
typeof(sf)









    Out[38]:





DataFrames.DataFrame



In [39]:

    
# Set missing values to zero
X=coalesce.(Matrix(sf[:,1:end-1]),0.0)









    Out[39]:





10387×102 Array{Float64,2}:
  0.179  -0.201   1.289   0.039   0.869  …  -0.961  -0.821  -0.551   1.279
  1.042   0.162   0.612   0.472   0.382     -0.788  -0.598  -0.078   0.0  
 -0.022  -0.482   0.0    -0.182   0.0        0.688   0.758  -0.002   0.108
 -0.597   0.183   0.593  -0.467   0.273      0.253  -0.187  -0.707   0.393
  0.699   0.639   0.739   0.349   0.0       -0.711  -3.391   0.769   1.519
  1.035  -3.015  -2.255  -2.725  -2.095  …  -1.235  -3.185  -0.895   1.845
  0.64   -4.9    -4.24   -4.34   -3.39      -1.04   -1.72   -1.41    2.62 
  0.904  -2.636  -2.446  -1.996   0.0       -0.986  -0.986  -0.976   1.114
  0.74   -2.15    0.0    -3.43    0.0       -1.17   -0.95   -1.78    1.87 
  0.771   0.0    -1.389  -1.249   0.0       -1.829   0.0    -1.049   0.391
  0.351  -0.829  -1.239  -0.679  -1.089  …  -1.139  -0.279  -0.789   0.441
  0.926  -1.034   0.0    -0.654   0.0        0.726   0.166  -0.104   0.0  
 -0.216  -0.126   0.484   0.024   0.134     -1.206   0.074  -0.226   1.434
  ⋮                                      ⋱                   ⋮            
 -0.995  -0.945   0.0     0.0    -0.145  …   0.205   0.0     0.295   0.0  
 -2.323  -2.043  -1.213  -2.633  -0.993      0.857  -0.743  -1.683   0.0  
  1.511   0.681   1.601   0.961   2.641      1.141   0.401  -1.119   0.521
  1.007   0.197   1.057   0.547   1.657      1.287  -0.043  -0.213   0.547
  0.968   0.688   1.278   0.838   1.038      2.228  -0.102  -1.342  -1.232
  1.346   1.396   2.616   1.506   2.366  …  -1.834   0.186  -0.004  -5.214
 -0.186   1.114   0.814   0.274   0.0        0.614  -0.266   0.204  -1.236
 -0.985  -1.715  -1.405  -1.605  -0.425      0.855   6.205  -0.155  -1.565
 -0.135  -1.315   0.225  -2.025   0.095      0.275   0.055  -0.605  -1.735
  0.095  -0.845  -0.895   0.0    -0.445      0.885   0.455   0.255  -0.695
  0.173  -0.057   0.0     0.243   0.0    …   0.263   0.143   0.103   0.163
  0.826  -0.054  -1.134   0.436  -0.444     -0.284  -0.244  -0.164  -0.164



In [40]:

    
μ=mean(X,dims=1)
σ=svdvals(X.-μ)
plot(σ)









    












    Out[40]:





1-element Array{PyCall.PyObject,1}:
 PyObject <matplotlib.lines.Line2D object at 0x00000000523813C8>



In [41]:

    
k=20
T=myPCA(X,k)
out=kmeans(T',k)
myPlot(out.assignments,T,k)



In [43]:

    
# Clustering the transpose
k=20
T=myPCA(Matrix(X'),k)
out=kmeans(T',k)
myPlot(out.assignments,T,k)



In [ ]:

	Column1	Column2	Column3	Column4	Column5	Column6	Column7	Column8
	Float64⍰	Float64⍰	Float64⍰	Float64⍰	Float64⍰	Float64⍰	Float64⍰	Float64⍰
1	0.179	-0.201	1.289	0.039	0.869	1.139	-0.421	-0.251
2	1.042	0.162	0.612	0.472	0.382	1.392	0.282	0.612
3	-0.022	-0.482	missing	-0.182	missing	0.988	-0.202	-0.572
4	-0.597	0.183	0.593	-0.467	0.273	missing	0.283	-0.387
5	0.699	0.639	0.739	0.349	missing	-0.081	0.229	-0.281
6	1.035	-3.015	-2.255	-2.725	-2.095	0.645	-0.295	-0.875
7	0.64	-4.9	-4.24	-4.34	-3.39	0.54	-2.6	-2.49
8	0.904	-2.636	-2.446	-1.996	missing	1.094	-1.236	-1.096
9	0.74	-2.15	missing	-3.43	missing	0.3	-1.73	-1.57
10	0.771	missing	-1.389	-1.249	missing	0.991	-1.269	-0.739
11	0.351	-0.829	-1.239	-0.679	-1.089	0.411	-0.899	-0.489
12	0.926	-1.034	missing	-0.654	missing	0.716	-0.254	-0.964
13	-0.216	-0.126	0.484	0.024	0.134	-0.606	-0.296	-0.226
14	-0.154	-0.884	missing	-0.414	missing	-0.414	-0.694	-0.454
15	-0.796	-2.096	-1.076	-0.806	0.074	-1.136	-1.536	-0.776
16	-0.791	-1.751	-0.591	-0.821	0.109	-0.811	-1.481	-0.671
17	-0.89	-1.78	-0.12	-0.36	0.6	-1.32	-1.3	-0.05
18	-1.192	-2.292	-0.992	-0.972	-1.152	-0.772	-1.542	-0.692
19	-1.441	-1.681	-0.491	-1.681	-0.311	missing	-1.641	-1.471
20	-1.103	-0.823	-1.403	-1.973	missing	-1.293	-1.153	-0.243
21	1.668	-0.322	0.688	1.158	1.588	0.398	0.408	-0.202
22	1.485	-0.075	0.805	1.085	1.785	0.275	0.405	-0.065
23	-0.348	0.032	missing	-0.488	0.212	-0.048	-0.138	0.372
24	-0.281	-0.571	0.029	missing	0.329	missing	-0.211	-0.141
25	-0.088	-0.688	missing	-0.668	-0.398	0.122	-0.028	0.352
26	-0.289	missing	-0.309	missing	-0.249	-0.469	0.411	-0.269
27	0.39	0.19	-0.62	-0.04	0.09	-0.04	0.13	0.14
28	-0.48	-1.0	-1.57	-1.19	-0.66	-0.8	-0.72	-0.76
29	0.446	-1.044	-0.444	-0.314	0.686	0.106	-0.534	-0.164
30	0.557	-1.163	0.267	-0.423	0.927	0.287	-0.433	0.497
&vellip	&vellip	&vellip	&vellip	&vellip	&vellip	&vellip	&vellip	&vellip